9/8/2020

Agenda for today

  • Regression examples
  • Classification example
  • Measuring accuracy
  • Bias-variance tradeoff

Temperature check

Regression examples

Equations from data

So far, we learned how to visualize and create numerical summaries of data.


During this course, we will go further: we want to fit an explicit equation, called regression model, that describes how one variable (the response \(y\)) changes as a function of other variables (the predictors \(X\)).


This is called regression, curve fitting or supervised learning: estimating a best guess for \(y\), given \(X\) - a conditional expected value \(\mathbb{E}[y \mid X]\).

Equations from data

For example, you might have heard the following rule of thumb: to calculate your maximum heart rate, subtract your age from \(220\).

We can express this rule as an equation: \[\text{MHR} = 220 - \text{Age}\]

This equation comes from data. The study probably was along these lines:

  • recruit people of varying ages
  • give them heart rate monitors
  • tell them to run fast on a treadmill
  • record their maximum heart rate

Equations from data

Data from this study would look like this:

Equations from data

It turns out that Equation 2 is a better equation: it makes smaller errors, on average.

Supervised learning

  • Outcome measurement \(Y\), also called response, target, dependent variable
  • Vector of \(p\) predictor measurements \(X\), also called inputs, regressors, covariates, features, independent variables
  • In the regression problem, \(Y\) is quantitative (e.g price, blood pressure)
  • In the classification problem, \(Y\) is qualitative, i.e. takes values in a finite, unordered set (e.g. survived/died, digit 0-9)

Goals: given training data \(\{(x_1,y_1),\dots,(x_n,y_n)\}\)

  1. accurately predict unseen test cases
  2. understand which inputs affect the outcome, and how
  3. make fair comparisons that adjust for the systematic effect of some variable

Making a prediction

Alice is 28. What is her predicted max heart rate?

Our equation expresses the conditional expected value of MHR, given a known value of age:

\[\mathbb{E}[\text{MHR} \mid \text{Age}] = 208 - 0.7 \times 28 = 188.4\]

This is our best guess without actually putting Alice on a treadmill test until she vomits.

Making a prediction

Understand how inputs affect the outcome

How does max heart rate change with age?

\[\mathbb{E}[\text{MHR} \mid \text{Age}] = 208 - 0.7 \times \text{Age}\]

So about 0.7 BPM slower, on average, with every additional year we age.

There is no guarantee that your MHR will decline at this rate; it is just a population-level average.

Make fair comparisons

A common use of regression models is to make fair comparisons by adjusting for the systematic effect of some common variable.

In this case we can adjust by how fast we expect two MHR to be, given two different ages.

Let us compare two people whose max heart rates are measured using an actual treadmill test:

  • Alice is 28 with a MHR of 185
  • Abigail is 55 with a MHR of 174

Clearly, Alice has a higher MHR, but let’s make things fair. We need to give Abigail a head start, since max heart rate declines with age.

Make fair comparisons

So, who has a higher maximum heart rate for their age? Key idea: compare actual MHR with expected MHR.

Alice’s actual MHR is 185, versus an expected MHR of 188.4

\[\begin{split} \text{Actual} - \text{Predicted} &= 185 - (208 - 0.7 \times 28) \\ &= 185 - 188.4 = -3.4 \end{split}\]

Abigail’s actual MHR is 174, versus an expected MHR of 169.5

\[\begin{split} \text{Actual} - \text{Predicted} &= 174 - (208 - 0.7 \times 55) \\ &= 174 - 169.5 = 4.5 \end{split}\]

  • Adjusting for \(X\)
  • Statistically controlling for \(X\)

Multiple inputs

  • Shown are Sales vs TV, Radio and Newspaper, with a blue linear regression line fit separately to each
  • Can we predict Sales using these three?
  • Perhaps we can do better using a model \[\text{Sales} \approx f(\text{TV, Radio, Newspaper})\]

Multiple inputs

  • Here \(Y = \text{Sales}\) is the response or target variable, that we wish to predict
  • TV is a feature; we name it \(X_1\)
  • Likewise name Radio as \(X_2\), and Newspaper as \(X_3\)
  • We can refer to the input vector collectively as \(\mathbf{X} = (X_1, X_2, X_3)^\intercal\)
  • Now we write our model as \[Y = f(\mathbf{X}) + \varepsilon\] where \(\varepsilon\) captures measurement errors and other discrepancies

Multiple inputs

The goals are the same as before:

  • With a good \(f\) we can make predictions of \(Y\) at new points \(\mathbf{X} = \mathbf{x}\)
  • We can understand which components of \(\mathbf{X} = (X_1,X_2, \dots,X_p)\) are important in explaining \(Y\), and which are irrelevant
  • Depending on the complexity of \(f\), we may be able to understand how each component \(X_j\) of \(\mathbf{X}\) affects \(Y\)

Regression example

  • We saw that the optimal solution is the regression function \(f(x) = \mathbb{E}[Y \mid x]\).
  • \(\varepsilon = Y - f(x)\) is the irreducible error, i.e. even if we knew the “true” \(f(x)\), we would still make errors in prediction since at each \(x\) there is typically a distribution of possible \(Y\) values.
  • For any estimate \(\hat{f}(x)\) of \({f}(x)\), we have \[\mathbb{E}[(Y - \hat{f}(x))^2 \mid x] = \underbrace{[f(x) - \hat{f}(x)]^2}_{\text{reducible}} + \underbrace{\text{Var}(\varepsilon)}_{\text{irreducible}}\]

Parametric vs nonparametric models

We can make an assumption on the functional form of \(f(x)\).

  • Constant model: \(f(x) = \beta_0\)
  • Linear model: \(f(x) = \beta_0 + \beta_1 x\)
  • Quadratic model: \(f(x) = \beta_0 + \beta_1 x + \beta_2 x^2\)
  • …

Linear model for \(p\) predictors: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} = \beta_0 + \sum_{j = 1}^{p} \beta_j x_{ij} = \mathbf{x}_{i}^{\intercal} \mathbf{\beta}\]

Let’s see an example of “nonparametric model”.

k-NN for regression

k-NN for regression

k-NN for regression

k-NN for regression

k-NN for regression

Let’s change \(k\)

k-NN for regression

Let’s change \(k\)

k-NN for regression

Let’s change \(k\)

k-NN for regression

Let’s change \(k\)

k-NN for regression

Let’s change \(k\)

Classification example

The setting

Here the response variable \(Y\) is qualitative, e.g. e-mail is one of \(\mathcal{C} = \{\text{spam}, \text{not spam}\}\), digit class is one of \(\mathcal{C} = \{0, 1, \dots, 9\}\).

Our goals are to:

  • Build a classifier \(C(X)\) that assigns a class label from \(\mathcal{C}\) to a future unlabeled observation \(X\)
  • Assess the uncertainty in each classification
  • Understand the roles of the different predictors among \(X = (X_1, X_2, \dots, X_p)\)

A first simple algorithm

A first simple algorithm

The bias-variance tradeoff

How to assess model accuracy?

Using the training data a standard measure of accuracy is the mean-squared error \[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}\left\{y_i - \hat{f}(x_i) \right\}^2\] This measure, on average, tells us how large the “mistakes” (errors) made by the model are…

How to assess model accuracy?

How to assess model accuracy?

\[Y = \beta_0 + \beta_1 x\]

How to assess model accuracy?

\[Y = \beta_0 + \beta_1 x + \beta_2 x^2\]

How to assess model accuracy?

\[Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\]

How to assess model accuracy?

\[Y = ???\]

How to assess model accuracy?

  • As we have seen in the examples above, there are lots of options in estimating \(f(X)\)

  • Some methods are very flexible some are not. Why would we ever choose a less flexible model?

    1. Simple, more restrictive methods are usually easier to interpret
    2. More importantly, it is often the case that simpler models are more accurate in making future predictions

Not too simple, but not too complex!

How to assess model accuracy?

Suppose we fit a model \(\hat{f}(x)\) to some training data \(\text{Tr} = \{x_i, y_i\}_{i=1}^{n}\), and we wish to see how well it performs.

We could compute the average squared prediction error over Tr, i.e.  \[\text{MSE}_{Tr} = \frac{1}{n} \sum_{i \in Tr} \left\{y_i - \hat{f}(x_i) \right\}^2\]

This is usually biased toward more complex models.

Instead we should, if possible, compute it using “fresh” test data \(\text{Te} = \{x_i^\star, y_i^\star\}_{i = 1}^{m}\), i.e. \[\text{MSE}_{Te} = \frac{1}{m} \sum_{i \in Te} \left\{y_i^\star - \hat{f}(x_i^\star) \right\}^2\]

How to assess model accuracy?

What if I was hiding some test data?

How to assess model accuracy?

What if I was hiding some test data?

How to assess model accuracy?

What if I was hiding some test data?

How to assess model accuracy?

What if I was hiding some test data?

General pattern

The bias-variance tradeoff

Bias: how does the model perform, i.e. what is the distance between the predicted value and the true value.

Variance: how “sensitive” are the predicted values when new data is obtained.

More to come in this course!

The bias-variance tradeoff

  • Why under-fitting is bad?
  • Why over-fitting is bad?
  • How do we know when the fit is just right?

Typically as the flexibility (complexity) of \(f(x)\) increases, its variance increases, and its bias decreases.

In order to control this behaviour, a penalty for models that are too complex should be used.

How to quantify the degree of this penalty? Based on average test error, not on training error.

The flexibilty-interpretability tradeoff

There is a natural tradeoff between:

  • Inference: understanding the way that Y is affected as X changes: “Which media contribute to sales?”; “Which media generate the biggest boost in sales?”
  • Prediction: predict the value of \(Y\) for new possible values of \(X\)

Recap

Today we saw:

  • General framework of supervised learning
  • Graphical interpretation of linear regression
  • How to compare different models (bias-variance tradeoff)


Next time:

  • How to actually find the regression line?
  • How to quantify model accuracy?
  • How to quantify uncertainty around the predictions?

Question time